OcrV1, Main, Exploration, bibRecord, 000041

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Identifieur interne : 000041 ( Main/Exploration ); précédent : 000040; suivant : 000042

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Auteurs : Stefan Senger ; Luca Bartek ; George Papadatos ; Anna Gaulton

Source :

Journal of Cheminformatics [ 1758-2946 ] ; 2015.

RBID : PMC:4594083

Abstract

Background

First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.

Results

When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.

Conclusions

In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required.

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594083

DOI: 10.1186/s13321-015-0097-z
PubMed: 26457120
PubMed Central: 4594083

Affiliations:

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents</title>
<author><name sortKey="Senger, Stefan" sort="Senger, Stefan" uniqKey="Senger S" first="Stefan" last="Senger">Stefan Senger</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Bartek, Luca" sort="Bartek, Luca" uniqKey="Bartek L" first="Luca" last="Bartek">Luca Bartek</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Papadatos, George" sort="Papadatos, George" uniqKey="Papadatos G" first="George" last="Papadatos">George Papadatos</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Gaulton, Anna" sort="Gaulton, Anna" uniqKey="Gaulton A" first="Anna" last="Gaulton">Anna Gaulton</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26457120</idno>
<idno type="pmc">4594083</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4594083</idno>
<idno type="RBID">PMC:4594083</idno>
<idno type="doi">10.1186/s13321-015-0097-z</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000038</idno>
<idno type="wicri:Area/Pmc/Curation">000038</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000019</idno>
<idno type="wicri:Area/Ncbi/Merge">000242</idno>
<idno type="wicri:Area/Ncbi/Curation">000242</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000242</idno>
<idno type="wicri:Area/Main/Merge">000039</idno>
<idno type="wicri:Area/Main/Curation">000041</idno>
<idno type="wicri:Area/Main/Exploration">000041</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents</title>
<author><name sortKey="Senger, Stefan" sort="Senger, Stefan" uniqKey="Senger S" first="Stefan" last="Senger">Stefan Senger</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Bartek, Luca" sort="Bartek, Luca" uniqKey="Bartek L" first="Luca" last="Bartek">Luca Bartek</name>
<affiliation><nlm:aff id="Aff1">GlaxoSmithKline, Stevenage, Hertfordshire SG1 2NY UK</nlm:aff>
<wicri:noCountry code="subfield">Hertfordshire SG1 2NY UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Papadatos, George" sort="Papadatos, George" uniqKey="Papadatos G" first="George" last="Papadatos">George Papadatos</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Gaulton, Anna" sort="Gaulton, Anna" uniqKey="Gaulton A" first="Anna" last="Gaulton">Anna Gaulton</name>
<affiliation><nlm:aff id="Aff2">European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD UK</nlm:aff>
<wicri:noCountry code="subfield">Cambridge CB10 1SD UK</wicri:noCountry>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of Cheminformatics</title>
<idno type="eISSN">1758-2946</idno>
<imprint><date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>First public disclosure of new chemical entities often takes place in patents, which makes them an important source of information. However, with an ever increasing number of patent applications, manual processing and curation on such a large scale becomes even more challenging. An alternative approach better suited for this large corpus of documents is the automated extraction of chemical structures. A number of patent chemistry databases generated by using the latter approach are now available but little is known that can help to manage expectations when using them. This study aims to address this by comparing two such freely available sources, SureChEMBL and IBM SIIP (IBM Strategic Intellectual Property Insight Platform), with manually curated commercial databases.</p>
</sec>
<sec><title>Results</title>
<p>When looking at the percentage of chemical structures successfully extracted from a set of patents, using SciFinder as our reference, 59 and 51 % were also found in our comparison in SureChEMBL and IBM SIIP, respectively. When performing this comparison with compounds as starting point, i.e. establishing if for a list of compounds the databases provide the links between chemical structures and patents they appear in, we obtained similar results. SureChEMBL and IBM SIIP found 62 and 59 %, respectively, of the compound-patent pairs obtained from Reaxys.</p>
</sec>
<sec><title>Conclusions</title>
<p>In our comparison of automatically generated vs. manually curated patent chemistry databases, the former successfully provided approximately 60 % of links between chemical structure and patents. It needs to be stressed that only a very limited number of patents and compound-patent pairs were used for our comparison. Nevertheless, our results will hopefully help to manage expectations of users of patent chemistry databases of this type and provide a useful framework for more studies like ours as well as guide future developments of the workflows used for the automated extraction of chemical structures from patents. The challenges we have encountered whilst performing this study highlight that more needs to be done to make such assessments easier. Above all, more adequate, preferably open access to relevant ‘gold standards’ is required.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13321-015-0097-z) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bregonje, M" uniqKey="Bregonje M">M Bregonje</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Chambers, J" uniqKey="Chambers J">J Chambers</name>
</author>
<author><name sortKey="Davies, M" uniqKey="Davies M">M Davies</name>
</author>
<author><name sortKey="Gaulton, A" uniqKey="Gaulton A">A Gaulton</name>
</author>
<author><name sortKey="Hersey, A" uniqKey="Hersey A">A Hersey</name>
</author>
<author><name sortKey="Velankar, S" uniqKey="Velankar S">S Velankar</name>
</author>
<author><name sortKey="Petryszak, R" uniqKey="Petryszak R">R Petryszak</name>
</author>
<author><name sortKey="Hastings, J" uniqKey="Hastings J">J Hastings</name>
</author>
<author><name sortKey="Bellis, L" uniqKey="Bellis L">L Bellis</name>
</author>
<author><name sortKey="Mcglinchey, S" uniqKey="Mcglinchey S">S McGlinchey</name>
</author>
<author><name sortKey="Overington, Jp" uniqKey="Overington J">JP Overington</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Akhondi, Sa" uniqKey="Akhondi S">SA Akhondi</name>
</author>
<author><name sortKey="Klenner, Ag" uniqKey="Klenner A">AG Klenner</name>
</author>
<author><name sortKey="Tyrchan, C" uniqKey="Tyrchan C">C Tyrchan</name>
</author>
<author><name sortKey="Manchala, Ak" uniqKey="Manchala A">AK Manchala</name>
</author>
<author><name sortKey="Boppana, K" uniqKey="Boppana K">K Boppana</name>
</author>
<author><name sortKey="Lowe, D" uniqKey="Lowe D">D Lowe</name>
</author>
<author><name sortKey="Zimmermann, M" uniqKey="Zimmermann M">M Zimmermann</name>
</author>
<author><name sortKey="Jagarlapudi, Sarp" uniqKey="Jagarlapudi S">SARP Jagarlapudi</name>
</author>
<author><name sortKey="Sayle, R" uniqKey="Sayle R">R Sayle</name>
</author>
<author><name sortKey="Kors, Ja" uniqKey="Kors J">JA Kors</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Heller, S" uniqKey="Heller S">S Heller</name>
</author>
<author><name sortKey="Mcnaught, A" uniqKey="Mcnaught A">A McNaught</name>
</author>
<author><name sortKey="Pletnev, I" uniqKey="Pletnev I">I Pletnev</name>
</author>
<author><name sortKey="Stein, S" uniqKey="Stein S">S Stein</name>
</author>
<author><name sortKey="Tchekhovskoi, D" uniqKey="Tchekhovskoi D">D Tchekhovskoi</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Southan, C" uniqKey="Southan C">C Southan</name>
</author>
<author><name sortKey="Varkonyi, P" uniqKey="Varkonyi P">P Varkonyi</name>
</author>
<author><name sortKey="Boppana, K" uniqKey="Boppana K">K Boppana</name>
</author>
<author><name sortKey="Jagarlapudi, Sarp" uniqKey="Jagarlapudi S">SARP Jagarlapudi</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Hattori, K" uniqKey="Hattori K">K Hattori</name>
</author>
<author><name sortKey="Wakabayashi, H" uniqKey="Wakabayashi H">H Wakabayashi</name>
</author>
<author><name sortKey="Tamaki, K" uniqKey="Tamaki K">K Tamaki</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tyrchan, C" uniqKey="Tyrchan C">C Tyrchan</name>
</author>
<author><name sortKey="Bostrom, J" uniqKey="Bostrom J">J Boström</name>
</author>
<author><name sortKey="Giordanetto, F" uniqKey="Giordanetto F">F Giordanetto</name>
</author>
<author><name sortKey="Winter, J" uniqKey="Winter J">J Winter</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations><list></list>
<tree><noCountry><name sortKey="Bartek, Luca" sort="Bartek, Luca" uniqKey="Bartek L" first="Luca" last="Bartek">Luca Bartek</name>
<name sortKey="Gaulton, Anna" sort="Gaulton, Anna" uniqKey="Gaulton A" first="Anna" last="Gaulton">Anna Gaulton</name>
<name sortKey="Papadatos, George" sort="Papadatos, George" uniqKey="Papadatos G" first="George" last="Papadatos">George Papadatos</name>
<name sortKey="Senger, Stefan" sort="Senger, Stefan" uniqKey="Senger S" first="Stefan" last="Senger">Stefan Senger</name>
</noCountry>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000041 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000041 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     PMC:4594083
   |texte=   Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i   -Sk "pubmed:26457120" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

Serveur d'exploration sur l'OCR

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.